Recording of patients’ mental health and quality of life-related outcomes in primary care: a cross-sectional study in the UK

Objective To compare patient-reported anxiety, depression and quality-of-life (QoL) outcomes, with data registered in patients’ primary care electronic health record (EHR). Design Cross-sectional study. Setting Primary care in the UK. Participants A convenience sample of 608 women registered in the Clinical Practice Research Datalink GOLD primary care database (data from a previous study on 356 breast cancer survivors (8.1 years postdiagnosis) and 252 women with no prior cancer). Outcome measures Patient-reported data on anxiety, depression and QoL, collected through postal questionnaires, and compared with coded information in EHR up to 2 years prior. Results Abnormal anxiety symptoms were reported by 118 of 599 women who answered the relevant questions (21%); 59/118 (50%) had general practitioner (GP)-recorded anxiolytic/antidepressant use, and 2 (1.6%) had anxiety coded in the EHR. 26/601 women (11%) reported depression symptoms, of whom 17 (65.4%) had GP-recorded antidepressant use and none had depression coded. 65 of 123 women reporting distress on the pain QoL domain (52.8%) had a corresponding record in the EHR <3 months before and 92 (74.8%) <24 months before. No patients reporting fatigue (n=157), sexual health problems (156), social avoidance (82) or cognitive problems (93) had corresponding codes in the EHR. There were no meaningful differences in the concordance results between breast cancer survivors and women with no history of cancer. Conclusion Many patients reporting mental health and QoL problems had no record of this in coded primary care data. This finding suggests that coded data does not fully reflect the burden of disease. Further research is needed to understand whether or not GPs are aware of patient distress in cases where codes have not been recorded.


Summary
We aim to assess the quality of life (QoL), and presence and severity of anxiety and depressive symptoms, in women who have had breast cancer diagnosed at ≥1 year, compared to women who did not have cancer.
The Clinical Practice Research Datalink (CPRD) primary care database will be used to select a random sample of breast cancer survivors (≥1 year), whose general practitioner (GP) agrees to participate in the study (see below), and who were registered with the practice for ≥1 year before and after the breast cancer diagnosis. Age-matched women who never had cancer will be randomly selected from the same practice. Staff at each practice will mail the study materials to the eligible women, who will complete the questionnaires and send those to the CPRD Intervention Studies Team for processing.
In addition, a secondary objective of this study is to assess whether PROs can be reasonably studied by using electronic health records (EHR), as these would involve fewer resources. For this, the EHR of the participating women will be collated from the CPRD primary care database and the results will be compared to those reported by the patients.

Background
Breast cancer is the most common malignancy diagnosed in women in the United Kingdom (UK), excluding non-melanoma skin cancer [1]. The five-year age-standardised net survival for patients diagnosed with breast cancer in 2005-09 was 81% [2]. Breast cancer survivors are the largest group of cancer survivors in the UK [3,4]: approximately 570,000 women were estimated to be living with or beyond breast cancer in 2010; this corresponds to 1,803 per 100,000 women [4]. The increasing trends in incidence and survival [1,2] suggest that the number of breast cancer survivors will continue to increase in the next decades [4].
Even though women now live longer after the breast cancer diagnosis, the disease is perceived as life threatening and a major cause of emotional distress [5]. Common reactions to the diagnosis include anxiety, feelings of loneliness, fear of death, hopelessness, anger, suicidal thoughts and existential issues [6,7]. In addition to the sorrow of the diagnosis, most women undergo a long and complex journey of aggressive treatments [8] with iatrogenic effects that are likely to have a long-term negative impact on their mental health and healthrelated quality of life (HRQoL) [9,10]. For example, surgery for tumour removal and lymph node status assessment may cause lymphoedema [11] and/or persistent pain [12], in addition to a life-long scar, which may change women's body image [13]. Chemotherapy may result in cognitive impairments [14,15] and/or cause amenorrhea in pre-menopausal women, bringing fertility concerns (for women who want children) and vasomotor symptoms such as hot flushes, night sweats, breast sensitivity and/or pain [16,17]. In the long-term, women also have to re-adapt to social and intimate relationships (including with their spouse occurring in a large number of general practices in the UK. This database currently includes data for more than 11.3 million patients, from over 600 general practices [29]. The cohort of cancer survivors in this database is one of the largest in the world with data prospectively and routinely collected at primary care level. As most mental disorders are also managed at primary care level [30, 31], the CPRD primary care database offers a unique opportunity to study long-term mental disorders in women who have had breast cancer. The information available for some domains of HRQoL may also represent an opportunity to study what are normally patient reported outcomes at a much lower cost but there has been no study evaluating the extent to which EHR data can be reasonably used to study HRQoL. . This study showed that breast cancer survivors had significantly increased odds of being prescribed antidepressants and anxiolytics but not of consulting for anxiety or depression, compared to women who did not have breast cancer [27]. The interpretation of these results is not straightforward because: 1) patients consulting for anxiety or depression are likely to represent the most severe cases, as these disorders, especially in the subthreshold or milder severities, are often undiagnosed [31] and their burden underestimated; 2) cancer survivors may have more contact with the health services and be therefore more likely to be diagnosed and/or treated for anxiety or depressive symptoms, compared to women who did not have breast cancer; 3) antidepressants may also be prescribed to breast cancer survivors as treatment for hot flushes [32], one of the commonest side effects of endocrine treatments [33], and it is unclear if the frequency of prescription of antidepressants for hot flushes differs between women who have had breast cancer and women who never had cancer. Considering this, it is unclear how well the data registered in the EHR represent the burden of anxiety and depressive conditions in the population. In addition, a population-based cohort study conducted in Denmark described a significantly increased risk of depression in the first years after diagnosis, whose magnitude and significance reduced over time [34].
Corresponding estimates for the five years after the diagnosis are not available in the UK.
The aim of this study is to investigate the HRQoL, and the presence and severity of anxiety and depressive symptoms, in breast cancer survivors (>1 year) and in women who did not have cancer. A secondary objective of this study is to compare the outcomes reported by the patients to the data available in the EHR. In doing so, we will assess the feasibility of using EHR to study outcomes that are usually reported directly by patients.

Aims
The primary aim of this study is to investigate the health-related quality of life (HRQoL), and the presence and severity of anxiety and depressive symptoms, in female breast cancer survivors (>1 year) compared to women who did not have cancer.
The secondary aim is to assess the feasibility of studying outcomes that are usually reported directly be patients by relying on the EHR data.

Study type
Descriptive.

Study site
England, Wales, Scotland and Northern Ireland.

Study design
Cross-sectional.

Study population
Women aged 18 to 80 years old, diagnosed with a first primary cancer of the breast at one year or more ago at the recruitment date, and who had been registered for at least two years with a general practice contributing with 'up to standard' data to CPRD at the moment of the recruitment.

Comparison group
Adult women (18-80 years) without a previous cancer diagnosis, selected from the same primary care practices of the cancer patients.

Recruitment of the participants
Participants will be recruited from primary care practices contributing with data to the CPRD primary care database, via their GP. GPs working in practices considered 'active' (i.e. contributing with data to CPRD at the time of recruitment), and whose data quality at practice level has been judged as 'up to standard' by the CPRD internal quality procedures, will be invited by the CPRD Intervention Studies Team to participate in the study. Refusal to participate in the study will be recorded. The EHR of the women registered with the GPs who accept to participate in the study will be collated. We will create a list of women who had a breast cancer recorded in the EHR using the list of Read codes provided in Appendix 1. We will then restrict the list to women aged 18-80 years, who were registered with the same primary care practice for at least one year before the breast cancer diagnosis, and who are currently alive, registered with the same practice, and have passed the first anniversary of their cancer diagnosis. A list of Read codes for other cancers [35] will be used to further exclude women who have had any other malignancy diagnosed before or after the breast cancer.

Breast cancer survivors
A random list of potentially eligible breast cancer survivors from each general practice will be selected. The number of women to be randomly selected from each practice will be calculated as the total number of women necessary for the study multiplied by the number of breast cancer survivors in the practice divided by the total number of potentially eligible breast cancer survivors in all practices.
The list of potentially eligible breast cancer survivors will be provided to the GP, and s/he will apply the following exclusion criteria: a) The woman had a another cancer (not detected in the EHR), or has been treated for a non-invasive breast tumour; b) The woman is considered unable to complete a self-administered questionnaire written in English for any reason.
The number of women excluded by the GP under each criterion will be recorded. Breast cancer survivors not excluded will be eligible for the study and invited to participate.
Women who did not have cancer A list of Read codes [35] will be used to exclude patients who have had cancer from the list of patients attending the same practices as the cancer survivors. In addition, patients who have not been registered continuously for the last two years with the practice and outside the age range 18-80 years will be excluded. Women still in the list are potentially eligible.
The number of women to be selected from each practice will be calculated as: total number of women without cancer necessary for the study times the number of women without cancer in the practice divided by the total number of women without cancer in all practices.
For each practice, we will then calculate the proportion of breast cancer survivors in the potentially eligible controls will be created by randomly selecting women with the same age distribution as of the breast cancer survivors of that same practice.
This list of potentially eligible controls will be sent to the GPs, and s/he will confirm that the women did not have a cancer and apply exclusion criteria a) and b).
Women not excluded will be considered eligible controls and invited to participate in the study.   4.1 Health-related quality of life Information on HRQoL will be collected using the Quality of Life in Adult Cancer Survivors Scale (QLACS) [36]. The QLACS was developed to take into account the specific needs of long-term cancer survivors, including issues that continue after treatment, new issues that arise during the period post-cancer, late physical effects of the cancer treatments and positive aspects of surviving to cancer [36]. It includes 47 items, divided in 7 generic and 5 cancerspecific domains (Appendix 2).
Breast cancer survivors will be asked to reply to all 47 items of the QLCAS (Appendix 3).
Women who never had cancer will reply to the 28 items of the generic domains (Appendix 4).

Anxiety and depression
Data on anxiety and depressive symptoms will be collected with the Hospital Anxiety and Depression Scale (HADS, 5) [37]. This is a 14-item self-reported screening tool for anxiety and depressive symptoms in the past week. It contains two sub-scales, one for anxiety (HADS-A) and another for depression (HADS-D), with 7 items each [37]. This scale has been validated for use in primary care [38] and was used in primary care studies in the UK [39][40][41].

Clinical and socio-demographic data
Breast cancer survivors will be asked to provide information about the type of treatments received, the stage of their disease at diagnosis, the time since the last treatment (excluding long-term hormonal therapy), their menopausal status before and after the treatment, and how the cancer responded to the treatment (Appendix 6).
For all women, we will also collect data on potential confounders of the association between cancer history and mental health outcomes: education, ethnicity and social support (Appendix 7). Information on other potential confounders, such as co-morbidities or age at diagnosis will be obtained from the EHR.

Deprivation measures
The CPRD GOLD primary care data will be linked to the Index of Multiple Deprivation data.
The IMD is an ecological measure based on the premise that deprivation can be measured Based on the 2011 Census, there were 32,844 LSOA in England. Mathematically, the IMD is calculated by using a set of indicators (at LSOA level) to produce information for seven domain indices that are related to material deprivation (income deprivation; employment deprivation; education, skills and training deprivation; health deprivation and disability; crime; barriers to housing and services; and living environment deprivation). The data from these seven domains are then combined using specified weights to produce a single measure of deprivation for each LSOA. The 32,844 LSOA are then sorted by measure of deprivation, and assigned a rank from 1 to 32,844, creating a relative measure of deprivation.
All GP practices contributing with data to the CPRD GOLD primary care database can be assigned IMD rank based on the GP practice post-code. This has been used in several studies as a proxy measure for socio-economic status at individual level because it is available for all patients, even though the ecological fallacy might apply (i.e. the individual experience may be different than the group). Patients can also be assigned an IMD rank based on their home address, but this is only available for the subset of patients that consent to the linkage scheme. We will request practice postcode level of IMD for all GP practices participating in the study, and patient postcode level of IMD for all potentially eligible patients (in 20-quantiles, which may be later combined into narrower categories in analyses).

Proportion of participation and exclusions
The proportion of GPs and patients who accept to participate in the study will be calculated for the whole of the UK, by country within the UK, by region, and by quintiles of practice-and patient-postcode level of IMD.
The proportion of patients considered by the GP as ineligible will be reported separately for breast cancer survivors and women who did not have cancer.
The proportion of breast cancer survivors who accept to participate in the study will be calculated, as well as the proportion of women who did not have cancer. The denominator will include all women in each group to whom questionnaires were sent, even though we expect a minor proportion of envelopes returned because the patient may have moved or died, or the address may not be correct. The QLACS includes 19 items for 5 cancer-specific domains of HRQoL (Appendix 2). Answers are provided on an ordinal Likert-type of scale, with values for individual items ranging from 1 to 7 [36]. For each breast cancer survivor, we will group the items by domain and calculate the sum of the individual scores under each domain [36]. All but one domain include 4 items; the "family distress" domain includes 3 items, and the sum of the individual scores will be rescaled to make the metric comparable with other domains. Values for each domain will range between 4 and 28. The range (minimum and maximum) scores will be reported for each domain, as well as the proportion of patients who score at the minimum and maximum values (floor and ceiling effects, respectively).
A mean or median score (depending on distribution) for each domain will be calculated from the individual-level sums of scores of the breast cancer survivors. Standard deviation will be calculated to quantify the dispersion of the data. The correlation coefficient among the mean scores of the domains will be reported.

Carreira H
A summary score for the cancer specific domains will be calculated by adding the mean/median scores of four domains ('financial problems', 'distress-family', 'appearance', and 'distress-recurrence'); the mean/median score for 'benefits from cancer' is not included.
We will use linear regression models to estimate the association between the cancer-specific HRQoL domain scores and patients factors, such as stage at diagnosis or type of surgery.
The dependent variable will be the sum of the individual items reported by each patient for that particular domain. The linear regression coefficients (β) from the regression models and the corresponding 95% confidence intervals will be reported. The QLACS includes 28 items for 7 generic domains of HRQoL, with values for individual items ranging from 1 to 7 [36] (Appendix 4).
The items will be grouped by domain (Appendix 2), and we will calculate, for each woman, the sum of the individual scores under each domain [36]. For each group of participants (i.e. breast cancer survivors and women who did not have cancer), the range (minimum and maximum) of the scores will be reported for each domain, as well as the proportion of women who score at the minimum and maximum values of the domain.
A mean or median score, depending on the distribution of the data, will be obtained for each group of women, by calculating the mean/median of the sum of the scores for each woman in that group. The respective standard deviation will be reported.
A summary score for the generic domains will be calculated as the sum of the individual domain scores.
The student's two-sample t-test, or a non-parametric alternative if needed (i.e. Mann-Whitney distribution free test), will be used to assess the evidence for a difference in the summary scores for each domain between the two groups.
Linear regression will be used to evaluate the impact of cancer diagnosis on the mean scores of HRQoL, adjusting for potential confounders. The role of socio-economic and clinical variables will be explored. The model fit and the linear regression coefficients (β) will be reported as well as the 95% confidence intervals. The Hospital Anxiety and Depression Scale contains two sub-scales, one for anxiety (HADS-A) and another for depression (HADS-D), with 7 items each [37]. Each item is rated from 0 to 3 and the total score for each sub-scale ranges between 0 and 21; higher scores represent higher symptoms of depression or anxiety [37].
To evaluate the severity of the depressive and anxiety symptoms in each group, the mean or median score, as appropriate, will be calculated for each sub-scale. The student's t-test or the Mann-Whitney test will be used to compare the mean/median score of depressive and of anxiety symptoms between cancer survivors and women who did not have cancer.
To identify patients with clinically relevant symptoms of depression or anxiety, the authors of the scale propose the cut-off of 0-7 for non-cases, 8-10 for borderline cases and 11-21 for probable cases, in both subscales.
The proportion of patients falling into the three categories (non-case, borderline, probable case) will be estimated for breast cancer survivors and for controls.
A chi-squared test will be used to assess whether there is evidence of differences in the proportion of patients in these categories between the two groups. A test for trend will be used to evaluate if there are increasing changes over the categories in each group.
The participants will then be categorised as having or not having clinically relevant levels of depressive or anxiety symptoms (cut off >10). Logistic regression models will be used to estimate the association between breast cancer history and clinically relevant levels of anxiety, and breast cancer history and clinically relevant symptoms of depression. The impact of clinical and demographic variables will be explored in the regression models. Crude and adjusted odds ratios, and respective 95% confidence intervals, will be reported.
Alexander et al. [42] evaluated the performance of the HADS as a screening test for major depressive disorder and anxiety in breast cancer survivors who were between 3 months and 2 years after main treatment conclusion (gold standard: non-patient Structured Clinical Interview for the Diagnostic and Statistical Manual of mental disorders (SCID)). Using the proposed cut-off of >10, the HADS-D had a sensitivity of 50% (95% confidence interval (95%CI): 27 to 73) and a specificity of 97% (95%CI: 93 to 99) [42]. However, the HADS-A had a sensitivity of 71% (95%CI: 30 to 95) and specificity of 87% (95%CI: 81 to 91) [42]. Even though the optimal cut-off for this population has not been established, a sensitivity of 50% may be too low to be acceptable in clinical practice, and therefore we will conduct a sensitivity analysis considering the cut-off of ≥8 to classify women as having clinically relevant symptoms of anxiety or depression.

Objective 4: To compare the information reported by the patients for HRQoL, and for depressive and anxiety symptoms, with the information registered in the EHR for similar constructs HRQoL
The QLACS includes seven generic domains of HRQoL (Appendix 4). Of these, five are particularly suitable for comparison with the data recorded in the EHR because women with distressing levels for these domains may have visited their GP to seek help: 'negative feelings', 'cognitive problems', 'physical pain', 'sexual problems' and 'fatigue'. Read codes for the 'social avoidance' domain are also available, and therefore we included also this domain.
For each woman, we will calculate the mean score for each domain (mean values will range between 1 and 7). Then, we will consider as reporting important levels of distress all women with a mean score of ≥5 (corresponding to replies of frequently, very often or always to most questions) in the domains of negative feelings, cognitive problems, physical pain, sexual problems and fatigue. Two sensitivity analyses will be conducted: 1) using a lower cut-off of ≥3 (corresponding to replies of sometimes and as often as not, in addition to replies of frequently, very often or always to most questions); 2) considering as exposed to important levels of distress all women who replied ≥5 to at least one item in the domain.
To identify evidence of the corresponding outcomes in the EHR, we will produce a list of Read codes closely related to the QLACS items for each domain (table 1). This list of Read codes will be used to identify women (who have had and who did not have breast cancer) with these outcomes registered in their EHR in the previous year (or since the first anniversary of diagnosis, if a cancer was diagnosed at less than 2 years).  We will estimate the proportion of women who reported distressing levels for these domains, and the proportion of women who have a recording of a similar construct in the EHR, separately for breast cancer survivors and for women who did not have cancer.
To estimate how much inquiring the patient adds to the information registered in the EHR, we will calculate the probability of: 1) having information for a particular domain registered in the EHR, among women who reported distressing levels for that domain (sensitivity); 2) not having any information registered in the EHR for a particular domain among women who did not report distressing levels for that domain (specificity);

3) reporting distressing levels for a particular domain among women who had information
for that domain registered in the EHR (positive predictive value); 4) not reporting distressing levels for a particular domain among women who did not have data for that domain registered in the EHR (negative predictive value).
All probabilities will be calculated separately for breast cancer survivors and for women who did not have cancer.

Anxiety and depression
The scores of the HADS-A and HADS-D will be used to classify women as having clinically relevant levels of anxiety and of depressive symptoms, respectively, using >10 as cut-off. The proportion of women scoring above this threshold will be calculated.
Women with a diagnosis of an anxiety and/or depressive disorder will be identified in the EHR through a list of Read codes. This list will be based on a systematic review of the literature to identify mental disorders in primary care databases. Women with a Read code for a depressive or anxiety disorder diagnosed in the last year will be considered depressed or anxious. A sensitivity analysis will include Read codes for symptoms of depression and/or anxiety, to account for the difficulties in the diagnosis of these conditions. We will calculate, for each group of women and for each disorder, the probability of: 1) having a diagnosis of anxiety/depression registered in the EHR among women who scored above the threshold in the HADS scale (sensitivity); 2) not having a diagnosis of anxiety/depression registered in the EHR among women who did not score above the threshold in the HADS scale (specificity); 3) scoring above the threshold in the HADS scale among women who had a diagnosis of anxiety/depression recorded in the EHR (positive predictive value); 4) not scoring above the threshold in the HADS scale among women who did not have a diagnosis of anxiety/depression recorded in the EHR (negative predictive value).

Plan for addressing missing data
We estimate that 5% of the women will have missing data for at least one item of the QLACS. This is a conservative estimate based on literature (the highest proportion of missing items was 3.2% [43]). The HADS has been shown to have excellent acceptability [37] and the proportion of missing items is usually small.
We will explore the pattern of missingness of the items by demographic and clinical variables.
For that purpose, a variable will be created to denote records with incomplete information and we will explore the association between this variable and clinical and demographic variables.
If the missingness can be explained by the other variables in the dataset, we will consider that it is missing at random, and specify a multiple imputation model to better represent the distribution from which the missing data came.

Sample size
We estimate that a sample of 260 breast cancer survivors and 260 women who did not have cancer are required to detect differences of the size reported in the literature. As participation rate in this type of studies has been low (approximately 20%), we believe that 1,400 women in each group need to be invited.
HRQoL Table 2 provides details of the sample size calculation for the comparison of the summary scores of HRQoL, and of the mean scores of the generic domains of HRQoL, between breast cancer survivors and women who did not have cancer. * Calculated as the estimated sample size rounded upwards to the next 10 subjects (to take into account the uncertainty of the estimation process) divided by 0.2 (the estimated proportion of participation), and added a 100 patients to account for other variables to be studied. 1 Women diagnosed with breast cancer at 18-24 months [44]. 2 Women diagnosed with breast cancer at 5 years of more [36,43].
The summary mean scores for the generic domains of the QLACS among breast cancer survivors were obtained from the literature [36,43,44]. The mean/median scores of the generic domains among women who did not have cancer have not been reported. However, in a study involving long-term survivors of breast, bladder, head and neck, gynaecologic, prostate and colorectal cancer [36], patients with colorectal cancer ranked the lowest summary score (indicating better HRQoL) for the generic domains of HRQoL (mean 60.9, SD=21.5). We used this score as a conservative estimate of the summary score of HRQoL in the general population, assuming that women who never have had cancer will not have worse HRQoL than the cancer patients who experience the best HRQoL. The same assumption was applied to estimate the sample size for the specific domains of HRQoL. Table 3 provides sample size estimates for the comparison of the mean scores of the two subscales of the HADS. As shown in the table, one study found a difference in mean HADS-Depression scores of just 0.6; to detect such a small difference would require 447 women per group, which would be beyond available resources. However, another study has calculated that differences of less than 1.4 in mean HADS-depression scores are not clinically important [45], and only around 75 patients per group would be required to detect differences above this level. For anxiety we would require 253 women per group to detect the minimum previously observed differences on the HADS scale. * Calculated as the estimated sample size rounded upwards to the next 10 subjects (to take into account the uncertainty of the estimation process) divided by 0.2 (the estimated proportion of participation), and added a 100 patients to account for other variables to be studied.  * Calculated as the estimate sample size rounded upwards to the next 10 subjects (to take into account the uncertainty of the estimation process) divided by 0.2 (the estimated proportion of participation), and added a 100 patients to account for other variables to be studied. According to the calculations, over 13,000 breast cancer survivors and 13,000 women who did not have cancer would be needed to compare the prevalence of depression between the two groups of women. Recruiting more than 1,500 women for this study is not feasible, and therefore we chose the sample size necessary to compare the mean scores of anxiety and depression between the two groups (n=1,400 in each group, as outlined above and in Table   3).

Feasibility counts
A total of 43,704 women with breast cancer, and who were at least one year post-diagnosis, were identified in the July 2015 cut of the CPRD primary care database. Of these, 21,564 women had acceptable records from practices contributing with 'up to standard' data. A total of 8,763 women were still registered in practices that contributed with data to CPRD during the year of 2016, of which 7,498 (86%) were aged between 18 and 80 years old. Table 5 describes the distribution of the patients by region within England.

Pilot study
We will invite all GPs working in practices contributing with 'up to standard' data to CPRD at the time of recruitment to participate in the study.
Packages containing paper questionnaires will be sent to 140 breast cancer survivors and 140 women who did not have cancer (10% of those to be invited), randomly selected from the list of patients attending the first practices to sign up for the study. The pilot phase will run for 1 month. After that time, we will estimate: 1) the proportion of participation in each group; 2) the age distribution of the participants in each group; 3) the number of questionnaires with missing items.
Sample size calculations will be revised, if necessary. Afterwards, paper questionnaires will be sent out to the remainder of women to be invited, up to the estimated sample size.

Limitations of the study design, data sources and analytical methods
We will use the CPRD primary care database to classify women as exposed or not to breast cancer. CPRD has been shown to capture more than 90% of the cancer diagnoses registered in the cancer registries [49]. This is considered acceptable for this project, even though a small proportion of the women may be incorrectly classified as unexposed. We will request that the GP revises the list of patients to exclude potentially misclassified cases.
We expect a substantial proportion of patients to decline to participate in the study, as shown by the proportion of participation in previous studies. Selection bias may occur if the patients who accept to participate in the study differ systematically from those who do not. We will compare the demographic characteristics of the women who participate in the study with the broad characteristics of the women who had breast cancer in the CPRD primary care database. Also, we assumed a similar participation rate by age-group between women with breast with and without cancer. We will compare the age-distribution of the final samples and take age into account in multivariate analyses if necessary.
Women who are unable to complete a self-administered questionnaire due to advanced disease (e.g. terminally ill, patients with dementia or severe mental illnesses) will be excluded from the study. Therefore, the generalizability of our results will be limited women with a relatively good cognitive function.
The QLACS was validated in the United States but not in the UK population of cancer survivors. However, no translation is required and the entire scale will be applied, which makes unlikely the occurrence of substantial bias. This study will have limited power to detect a strong association between having had a breast cancer and depression as defined by the cut-offs of the HADS scale. Our primary outcome will be the difference of the mean scores of each sub-scale, for which this study will have enough power.

Patient or user group involvement
Two women who never had cancer revised the invitation letter, participant information sheets and questionnaires for women in the non-cancer comparison group.
Breast cancer survivors identified through the Independent Cancer Patients' Voice (a patient advocate group and charity) revised the materials for breast cancer survivors. Comments from each group were incorporated into the study materials.
We will also ask selected members of the public and breast cancer survivors to comment on the report produced to share the study results prior to making these available. We plan to disseminate the results with the publication of an article in a peer-reviewed scientific journal. We will also present preliminary finding at scientific meetings.
To share the results with the general public, we will make the study results publicly available online. We will create a study's webpage on the website of the London School of Hygiene & Tropical Medicine. The website address for this webpage will be included in the participant information packs. A summary of findings from the study will be posted on the study webpage in due course. Anyone visiting this webpage (whether a participant, invitee, general practitioner or any interested member of the public) will be able to provide a contact email address through the webpage to subscribe for updates. The study researchers will use these contact email addresses for the sole purpose of letting interested parties know about updates to the study webpage.